A Complete Journey from Basics to Advanced Concepts
Think of an LLM like a master chef in a restaurant. When you give them a recipe (prompt), they don't cook the entire meal at once. Instead, they prepare it one ingredient at a time, tasting and adjusting as they go. Each "ingredient" is like a token (word or part of a word) that the model generates one by one.
Tokens are the basic units that LLMs understand - they can be words, parts of words, or even punctuation!
Process the entire prompt in parallel
Fast but memory-intensive
Generate tokens one by one
Slow but predictable
Prefill is like speed-reading an entire book (your prompt) very quickly to understand the context.
Decode is like writing a response letter word by word, thinking carefully about each word before writing the next one.
đ¨ The Big Problem: Each new token depends on ALL previous tokens!
This creates a memory bottleneck and limits how fast we can generate text.
Imagine you're in a study group discussing a complex topic. Instead of re-reading the entire textbook every time someone asks a question, you keep detailed notes (KV Cache) of everything discussed so far. When a new question comes up, you can quickly reference your notes instead of starting from scratch!
Keys (K): Help the model find relevant information
Values (V): Store the actual information content
Each token gets its own Key-Value pair stored in memory
Token 1: Process [The]
Token 2: Process [The, cat] â Recalculate everything!
Token 3: Process [The, cat, sat] â Recalculate everything again!
Token 1: Process [The] â Store K1,V1
Token 2: Use K1,V1 + Process [cat] â Store K2,V2
Token 3: Use K1,V1,K2,V2 + Process [sat] â Store K3,V3
đĨ For a 70B model like Llama 3.3:
âĸ Each token needs ~800KB of KV cache
âĸ A 2048-token conversation = 1.6GB just for cache!
âĸ This grows linearly with context length
Think of KV cache like a growing library. Each new book (token) you add needs shelf space (memory). As your library grows, you need more and more shelves. Eventually, you run out of space and need clever storage solutions!
Compress the cache data from 16-bit to 8-bit or even 4-bit numbers. Less accurate but much smaller!
Break cache into small "pages" like computer memory. Avoid wasting space on unused memory!
Move old cache data to CPU memory or disk when GPU memory gets full.
Imagine you're running a bus service. You could send a separate bus for each passenger (inefficient), or you could group passengers going in the same direction and use one big bus (efficient batching)!
Instead of reserving huge chunks of memory for each user, PagedAttention divides memory into small "pages" and assigns them as needed - just like how your computer's operating system manages memory!
Imagine a factory where one team is really good at preparing ingredients (prefill) and another team excels at final assembly (decode). Instead of having each worker do both tasks poorly, you separate them into specialized stations for maximum efficiency!
Prefill: Needs lots of compute, short burst
Decode: Needs consistent memory access, long duration
Together: They fight for resources and slow each other down!
Specialized for parallel processing
Optimized for TTFT
High-speed network
~17ms for 2048 tokens
Specialized for sequential generation
Optimized for throughput
Prefill doesn't interfere with decode operations. Users get consistent response times.
Use different GPU configurations optimized for each phase's specific needs.
Up to 7x higher request rates with the same SLA requirements.
Scale prefill and decode independently based on actual demand patterns.
đĸ OpenAI and Google use disaggregated serving!
For ChatGPT: Prefill cluster handles your prompt, then hands off to decode cluster for streaming response generation.
Imagine a chess master playing against a computer. The master can quickly think of several good moves (draft), then the computer carefully verifies which ones are actually legal and best (verification). This way, multiple moves can be planned in the time it usually takes to plan one!
Small, fast model
Generates 3-5 candidate tokens
Large target model
Checks all candidates in parallel
Keep good predictions
Reject bad ones
Use a smaller version of the same model (e.g., 7B drafting for 70B)
Speed: 2-3x faster
Use the same model with some layers skipped for drafting
Speed: 1.5-2x faster
Add multiple prediction heads to the main model
Speed: 2-3x faster
Reuse tokens that already appeared in the prompt
Speed: Very context-dependent
Current context: "The capital of France is"
đ Google AI Overviews uses speculative decoding for 2-4x speedup
⥠Perfect for scenarios where GPU is underutilized (small batch sizes)
đ¯ Best results when draft model has 60-80% acceptance rate
Think of GGUF files like mobile apps that are optimized to run on your phone instead of requiring a powerful desktop computer. They're compressed and efficient versions of the full models that can run on consumer hardware!
G: Georgi (creator's name)
G: Gerganov (creator's surname)
U: Universal
F: File format
A special file format that stores AI models in a compressed, CPU-friendly way!
Size: ~4GB (70B model)
Quality: Good balance
Speed: Fast
Size: ~8GB (70B model)
Quality: High quality
Speed: Moderate
Size: ~140GB (70B model)
Quality: Original quality
Speed: Slower
Pull GGUF model from Hugging Face or Ollama registry
Load into system memory (RAM/GPU)
Start chatting with OpenAI-compatible API
âĸ Requires GPU with 80GB+ VRAM
âĸ Full 16-bit precision weights
âĸ Complex serving infrastructure
âĸ Size: ~140GB for 70B model
âĸ Runs on laptop with 32GB RAM
âĸ Quantized 4-bit weights
âĸ Simple single-command setup
âĸ Size: ~40GB for 70B model
đģ CPU Inference: 1-5 tokens/second on M2 MacBook
đĨī¸ GPU Inference: 10-50 tokens/second on RTX 4090
đ§ Memory Usage: Model size + 2-4GB for context
Different inference servers are like different types of racing cars. A Formula 1 car (TensorRT-LLM) is fastest on a professional track, a rally car (vLLM) works great in various conditions, and a family sedan (TGI) is reliable and easy to drive everywhere!
Best for: Research, fast TTFT
Strengths: PagedAttention, easy setup
Hardware: NVIDIA, AMD, Intel
Weaknesses: Newer, less enterprise features
Best for: Maximum NVIDIA GPU performance
Strengths: Fastest on NVIDIA, FP8 support
Hardware: NVIDIA only
Weaknesses: Complex setup, compilation needed
Best for: Enterprise, multi-framework
Strengths: Mature, supports any model
Hardware: All major platforms
Weaknesses: Complex configuration
Best for: HuggingFace ecosystem, beginners
Strengths: Easy setup, good docs
Hardware: Broad support
Weaknesses: Not always fastest
YES â TensorRT-LLM
NO â Continue...
YES â vLLM or TGI
NO â Continue...
YES â Triton
NO â vLLM
đĨ Production at Scale: Most companies use multiple servers!
đ Example: TensorRT-LLM for high-throughput batch inference + vLLM for interactive chat
đ Trend: Moving toward disaggregated architectures with specialized servers for prefill vs decode